Introduction to Open Data Science - Course Project

About the project

I have learned about the course Introduction to Open Data Science through my PhD studies. I’m exited to start the course but also a bit worried about the workload. The course Moodle page seems to contain a lot of information, so probably will take a while to get a handle on things.

I have previously used R and Rstudio on some courses, but I’m no expert. I think that exercise 1 was too long, and all in all there was too much stuff to be learned about R in the first week. To me, this one really long exercise feels exhausting. It is nice that there are the R codes already embedded, but I feel smaller chunks would be better. If I hadn’t done R before, I would be terrified right now.

Link to my GitHub repository


Regression and model validation

date()
## [1] "Thu Dec 01 16:44:18 2022"

Basic information about dataset

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5      v purrr   0.3.4 
## v tibble  3.1.2      v dplyr   1.0.10
## v tidyr   1.1.3      v stringr 1.4.0 
## v readr   1.4.0      v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
students2014 <- read_csv("./data/learning2014.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   gender = col_character(),
##   age = col_double(),
##   attitude = col_double(),
##   deep = col_double(),
##   stra = col_double(),
##   surf = col_double(),
##   points = col_double()
## )
dim(students2014)
## [1] 166   7
str(students2014)
## spc_tbl_ [166 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )

Dataset is created from data: https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS3-meta.txt
Data is collected from students in course ‘Johdatus yhteiskuntatilastotieteeseen’ in fall 2014.

Dataset students2014 contains 7 variables with 166 observations: gender, age, attitude, deep, stra, surf, and points.

All variables in the original data related to strategic, deep, and surface learning have been combined into three variables (stra, deep, and surf, respectively), and scaled to original scales by taking the mean.

Attitude tells about student’s global attitude towards statistics. Variable has been scaled back to the original scale of the questions, by dividing it with the number of questions.

Points tells the course exam points. Observations with points equaling 0 have been excluded, since those students didn’t attend the course exam.

Describing the data

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# create a plot matrix with ggpairs()
p <- ggpairs(students2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

summary(factor(students2014$gender))
##   F   M 
## 110  56
summary(students2014$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   21.00   22.00   25.51   27.00   55.00
summary(students2014$attitude)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.400   2.600   3.200   3.143   3.700   5.000
summary(students2014$deep)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.583   3.333   3.667   3.680   4.083   4.917
summary(students2014$stra)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.250   2.625   3.188   3.121   3.625   5.000
summary(students2014$surf)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.583   2.417   2.833   2.787   3.167   4.333
summary(students2014$points)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   19.00   23.00   22.72   27.75   33.00

There are 66% (110/166) women in respondents. The median age is 22 years (range 17-55), the median age of women seems to be a bit lower than that of men.

The median attitude is 3.2 (range 1.4-5). The median attitude of men seems higher than that of women.

The median for deep learning is 3.7 (range 1.6-4.9). The median for strategic learning is 3.1 (range 1.3-5). The median for surface learning is 2.8 (range 1.6-4.3). For women, the median for strategic and surface learning seems a bit higher than for men.

The median for exam points is 23 (range 7-33). The median is about the same in men and women.

Attitude correlates with exam points, so that with “better” attitude towards statistics one usually gets higher exam points.

Surface learning correlates negatively with deep and strategic learning as well as attitude. So with better attitude one less often utilizes surface learning. If one utilizes surface learning, one less often utilizes deep or strategic learning.

Interpreting the fitted regression model

#fit a linear regression model using points as the outcome and attitude, strategig and surface learning as explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = students2014)
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
#remove surface learning from the model, since it's so far from significant
my_model2 <- lm(points ~ attitude + stra, data = students2014)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09
#error rate
sigma(my_model2)*100/mean(students2014$points)
## [1] 23.28148

First we choose three variables that correlate the most with points in above visualization (namely attitude, strategic and surface learning). However, after fitting the model, we see that for surface learning there is a really high p-value (0.466), so we remove that and fit the model again with just attitude and strategic learning.

In the new model, p-value of the F-statistic is 7.734e-09. So at least one of the explanatory variables is significantly associated with exam points.

For attitude, p-value is <0.001, so it is significantly associated with exam points. According to this model, if attitude increases by 1, exam points increase by approximately 3.5.

For strategic learning, p-value is 0.089. It isn’t <0.05 that’s usually deemed significant, however, it is quite close. So we probably don’t want to state that strategic learning has no effect on exam points. On the other hand, the estimated intercept for strategic learning is 0.9, so a change in attitude makes a much bigger difference than a change in strategic learning.

The adjusted R-squared is 0.1951, which means that attitude (and strategic learning) only explain around 20% of the variance in exam points.

The residual standard error is 5.289 which corresponds to 23% prediction error rate.

Assumptions of the model

plot(my_model2, which=c(1,2,5))

Linear regression assumes a linear relationship between predictors and outcome, that residual errors are normally distributed, that residuals have a constant variance, and independence of residual error terms.

With Residuals vs. Fitted plot we can check the linearity of data. Here, the red line seems to be horizontal at approximately zero, where it should be. So we can assume a linear relationship.

With plot Normal Q-Q we can check if residual errors are normally distributed. The plot of residuals should more or less follow the dashed line. Here they mostly do so, with some exceptions at the upper and lower ends. However, we may assume normality.

The Residuals vs. Leverage plot marks three most extreme points (35, 71, 145). Two of them exceed 3 standard deviations, so they are possible outliers.


Logistic regression

date()
## [1] "Thu Dec 01 16:44:29 2022"

Short description of dataset

library(tidyverse)
alc <- read_csv("./data/alc.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_character(),
##   age = col_double(),
##   Medu = col_double(),
##   Fedu = col_double(),
##   traveltime = col_double(),
##   studytime = col_double(),
##   famrel = col_double(),
##   freetime = col_double(),
##   goout = col_double(),
##   Dalc = col_double(),
##   Walc = col_double(),
##   health = col_double(),
##   failures = col_double(),
##   absences = col_double(),
##   G1 = col_double(),
##   G2 = col_double(),
##   G3 = col_double(),
##   alc_use = col_double(),
##   high_use = col_logical()
## )
## i Use `spec()` for the full column specifications.
dim(alc)
## [1] 370  35
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Dataset is created from data: https://archive.ics.uci.edu/ml/datasets/Student+Performance
The data approaches student achievement in secondary education of two Portuguese schools, and the dataset joins two student alcohol consumption datasets. The variables not used for joining the two data have been combined by averaging (including the grade variables). ‘alc_use’ is the average of ‘Dalc’ and ‘Walc’. ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise.

Dataset alc contains 35 variables with 370 observations.

My personal hypothesis about relationships of with alcohol consumption

I think interesting variables in relation to alcohol consumption are sex, number of past class failures, final grade (G3) and number of school absences.
My hypothesis for these variables are:
1. Men’s alcohol consumption is higher than women’s.
2. Those who consume more alcohol have more past class failures compared to those who consume less alcohol.
3. Those who consume more alcohol have lower final grades compared to those who consume less alcohol.
4. Those who consume more alcohol have more school absences compared to those who consume less alcohol.

Description of chosen variables and their relationships with alcohol consumption

library(dplyr); library(ggplot2); library(GGally); library(finalfit)


dependent <- "high_use"
explanatory <- c("sex", "failures", "G3", "absences")
alc %>% 
  summary_factorlist(dependent, explanatory, p = TRUE,
                     add_dependent_label = TRUE)
## Warning in chisq.test(failures, high_use): Chi-squared approximation may be
## incorrect
##  Dependent: high_use                FALSE       TRUE      p
##                  sex         F 154 (59.5)  41 (36.9) <0.001
##                              M 105 (40.5)  70 (63.1)       
##             failures         0 238 (91.9)  87 (78.4)  0.003
##                              1   12 (4.6)  12 (10.8)       
##                              2    8 (3.1)    9 (8.1)       
##                              3    1 (0.4)    3 (2.7)       
##                   G3 Mean (SD) 11.8 (3.4) 10.9 (3.0)  0.011
##             absences Mean (SD)  3.7 (4.5)  6.4 (7.1) <0.001
#median and quartiles of the numeric variables
summary(alc$failures)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1892  0.0000  3.0000
summary(alc$G3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   12.00   11.52   14.00   18.00
summary(alc$absences)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   4.511   6.000  45.000
#draw a barplot of alcohol consumption by sex
ggplot(data = alc, aes(x = high_use, fill=sex)) + geom_bar()

#produce summary statistics of final grade by alcohol consumption and sex
alc %>% group_by(sex, high_use) %>% summarise(mean_grade=mean(G3),count = n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use mean_grade count
##   <chr> <lgl>         <dbl> <int>
## 1 F     FALSE          11.4   154
## 2 F     TRUE           11.8    41
## 3 M     FALSE          12.3   105
## 4 M     TRUE           10.3    70
#draw a boxplot of final grade by alcohol consumption and sex
ggplot(alc, aes(x = high_use, y = G3, col=sex)) + geom_boxplot() + ylab("grade")

#produce summary statistics of school abscences by alcohol consumption and sex
alc %>% group_by(sex, high_use) %>% summarise(mean_absences=mean(absences),count = n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use mean_absences count
##   <chr> <lgl>            <dbl> <int>
## 1 F     FALSE             4.25   154
## 2 F     TRUE              6.85    41
## 3 M     FALSE             2.91   105
## 4 M     TRUE              6.1     70
#draw a boxplot of school absences by alcohol consumption and sex
ggplot(alc, aes(x = high_use, y = absences, col=sex)) + geom_boxplot() + ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")

There are 195 women and 175 men. Of those, 21% (41/195) and 40% (70/175) have high alcohol consumption, respectively. This supports my hypothesis of men consuming more alcohol than women.

The mean of past class failures is 0.2 (range 0-3). Of high users, 3% (3/111) have 3 failures, 8% (9/111) have 2 failures, and 11% have 1 failure. Of others, 0.4% (1/259) have 3 failures, 3% (8/259) have 2 failures, and 5% have 1 failure. Thus it would seem that students with high alcohol consumption have failed class more often than those with low alcohol consumption.

The mean final grade is 11.5 (range 0-18). Men with high alcohol consumption seem to have lower final grades than people with low alcohol consumption. Interestingly, women with high alcohol consumption seem to have around the same final grades than people with low alcohol consumption. So my hypothesis is supported only by observations of men.

The mean number of absences is 4.5 (range 0-45, median 3). The median of school absences is higher for men with high alcohol consumption than other groups. However, if we look at the mean, women with high alcohol consumption have the highest mean of absences, and both men and women with high alcohol consumption have higher mean than those with low alcohol consumption. This is somewhat in line with my hypothesis.

Logistic regression

# fit a logistic regression model 
my_model <- glm(high_use ~ sex + failures + absences + G3, data = alc, family = "binomial")

# print out a summary of the model
summary(my_model)
## 
## Call:
## glm(formula = high_use ~ sex + failures + absences + G3, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1561  -0.8429  -0.5872   1.0033   2.1393  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.38733    0.51617  -2.688  0.00719 ** 
## sexM         1.00870    0.24798   4.068 4.75e-05 ***
## failures     0.50382    0.22018   2.288  0.02213 *  
## absences     0.09058    0.02322   3.901 9.56e-05 ***
## G3          -0.04671    0.03948  -1.183  0.23671    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 405.59  on 365  degrees of freedom
## AIC: 415.59
## 
## Number of Fisher Scoring iterations: 4
# fit a new model without G3 model 
my_model2 <- glm(high_use ~ sex + failures + absences, data = alc, family = "binomial")

# print out a summary of the model
summary(my_model2)
## 
## Call:
## glm(formula = high_use ~ sex + failures + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1550  -0.8430  -0.5889   1.0328   2.0374  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.94150    0.23129  -8.394  < 2e-16 ***
## sexM         0.99731    0.24725   4.034 5.49e-05 ***
## failures     0.59759    0.20698   2.887  0.00389 ** 
## absences     0.09245    0.02323   3.979 6.91e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 406.99  on 366  degrees of freedom
## AIC: 414.99
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(my_model2) %>% exp

# compute confidence intervals (CI)
CI <- confint(my_model2) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.1434892 0.08940913 0.2219242
## sexM        2.7109881 1.67967829 4.4367282
## failures    1.8177381 1.21997630 2.7642305
## absences    1.0968598 1.05041937 1.1508026

In the first model, final grade has a p-value 0.2, and thus isn’t associated with high alcohol consumption. Let’s fit the model again without it. In the second model, all variables are associated with the outcome, and also the AIC drops from 415.59 to 414.99, which indicates that G3 can be dropped from the model.

According to this model, high alcohol consumption is associated with being male (OR 2.7, 95% CI 1.7-4.4, p-value <0.001), having failed class (OR 1.8, 95% CI 1.2-2.8, p-value 0.004), and school absences (OR 1.1, 95% CI 1.1-1.2, p-value <0.001).
My primary hypothesis were correct regarding being male and having failed classes. However, there is only slight association with school absence (OR 1.1), and the final grade doesn’t seem to be associated with high alcohol consumption.

Predictive power of the model

# predict() the probability of high_use
probabilities <- predict(my_model2, type = "response")

library(dplyr)
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   252    7
##    TRUE     78   33
# draw a plot of 'high_use' versus 'probability' in 'alc'
ggplot(alc, aes(x = probability, y = high_use)) + geom_point()

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2297297

In the 2x2 table we can see that 252 low and 33 high alcohol consumers are predicted correctly. In the predictions, however, there are 78 false negatives and 7 false positives. According to the training error, 23% of the individuals are classified inaccurately. This is better than tossing a coin (where the chances would be 50/50), but I guess with some background information one could get quite close with educated guesses.
## 10-fold cross-validation of the model

# 10-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = my_model2, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2378378
#Let's try the same with another model. 
my_model3 <- glm(high_use ~ traveltime + sex + absences +studytime + goout, data = alc, family = "binomial")
summary(my_model3)
## 
## Call:
## glm(formula = high_use ~ traveltime + sex + absences + studytime + 
##     goout, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8435  -0.7908  -0.4914   0.6936   2.5211  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.72342    0.67103  -5.549 2.88e-08 ***
## traveltime   0.35147    0.18157   1.936 0.052908 .  
## sexM         0.81643    0.27225   2.999 0.002710 ** 
## absences     0.07919    0.02293   3.454 0.000552 ***
## studytime   -0.40974    0.17327  -2.365 0.018041 *  
## goout        0.71830    0.12152   5.911 3.40e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 363.51  on 364  degrees of freedom
## AIC: 375.51
## 
## Number of Fisher Scoring iterations: 4
prob2 <- predict(my_model3, type = "response")
loss_func(class = alc$high_use, prob = prob2)
## [1] 0.2027027
cv2 <- cv.glm(data = alc, cost = loss_func, glmfit = my_model3, K = 10)
cv2$delta[1]
## [1] 0.2135135

The prediction error of the model when using 10-fold cross-validation is around 0.24, which is slightly less than in the exercise set.
With the other tested model (explanatory variables being sex, home to school travel time, number of school absences, weekly study time, and going out with friends), the prediction error with 10-fold cross-validation is around 0.22, which is still lower.


Clustering and classification

date()
## [1] "Thu Dec 01 16:44:34 2022"

Short description of dataset

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data 
data("Boston")

# check structure and dimensions
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The dataset contains housing values in suburbs of Boston. It can be downloaded from R’s MASS package, and contains 506 observations of 14 variables. More information on the data and variables can be found here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

Graphical overview of the data and summaries of the variables

library(tidyr)
# check summary of data
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# draw pairs plot of the variables
pairs(Boston)

# calculate the correlation matrix and round it
cor_matrix <- cor(Boston)
cor_matrix %>% round(2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle")

The summary of the data shows means, medians and ranges of the different variables, for example median number of rooms per dwelling is 6.2 (range 3.6-8.8).

Tax rate and accessibility to radial highways seem to be strongly positively correlated. Median value and lower status of the population are negatively correlated. The plotted correlation matrix shows also other positive and negative correlations. The darker blue a sphere is, the stronger the positive correlation between variables, and the darker red a sphere is, the stronger the negative correlation.

Standardizing the dataset

library(dplyr)
# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

boston_scaled$crim <- as.numeric(boston_scaled$crim)

# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, label = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

The variables are now transformed on the same scale, so now we can compare them.

Linear discriminant analysis and predictions

# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2450495 0.2698020 0.2425743 0.2425743 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       1.02710231 -0.9361963 -0.07348562 -0.9019535  0.45166192 -0.8860954
## med_low  -0.09013496 -0.3153459 -0.05560795 -0.5758098 -0.10931800 -0.3418145
## med_high -0.39011035  0.1878223  0.12941585  0.4258435  0.06042757  0.4476419
## high     -0.48724019  1.0171960 -0.03128211  1.0454272 -0.40976782  0.8140249
##                 dis        rad        tax     ptratio       black      lstat
## low       0.9009973 -0.6953345 -0.7440425 -0.45722953  0.38139055 -0.7871298
## med_low   0.3858928 -0.5425035 -0.4567556 -0.03963556  0.31898461 -0.1783893
## med_high -0.3768092 -0.3924029 -0.2943844 -0.32353311  0.05359455  0.0243487
## high     -0.8519138  1.6373367  1.5134896  0.77985517 -0.70113306  0.8562150
##                medv
## low       0.5412905
## med_low  -0.0161358
## med_high  0.1426636
## high     -0.6342003
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.107944522  0.69162436 -1.00475757
## indus    0.007696283 -0.30231329  0.18031083
## chas    -0.075126288 -0.02620317  0.07101621
## nox      0.372969815 -0.71523191 -1.20639445
## rm      -0.096657282 -0.04989508 -0.12170642
## age      0.289955368 -0.37694911 -0.19399392
## dis     -0.125259487 -0.29750432  0.22853043
## rad      3.069135579  0.85299893 -0.22908753
## tax     -0.111014094  0.13649334  0.76570088
## ptratio  0.112468132  0.02735571 -0.19819824
## black   -0.128766760  0.02534500  0.13761030
## lstat    0.177754590 -0.11035798  0.28520541
## medv     0.160530148 -0.28410829 -0.20396590
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9413 0.0432 0.0154
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       13      13        2    0
##   med_low    3      10        4    0
##   med_high   0      11       17    0
##   high       0       0        0   29

It would seem that with high crime rate the predictions are most accurate. With low to medium and medium to high there is most inaccuracy.

Distance measures and k-means clustering

library(ggplot2)
data(Boston)
boston_scaled2 <- as.data.frame(scale(Boston))

# euclidean distance matrix
dist_eu <- dist(boston_scaled2)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# k-means clustering
set.seed(123)
km <- kmeans(boston_scaled2, centers = 3)

# plot part of the Boston dataset with clusters
pairs(boston_scaled2[6:10], col = km$cluster)

# determine the maximum number of clusters
k_max <- 10

# calculate the total within sum of squares
set.seed(123)
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
set.seed(123)
km <- kmeans(boston_scaled2, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)

# plot part of the Boston dataset with clusters
pairs(boston_scaled2[6:10], col = km$cluster)

From plotting the total within sum of squares we see that around two the value changes quite a lot, so the appropriate number of cluster would be two. The data seems to divide nicely between two clusters according to most variables.

Bonus

data(Boston)
boston_scaled3 <- as.data.frame(scale(Boston))

# k-means clustering
set.seed(123)
km2 <- kmeans(boston_scaled3, centers = 3)

# linear discriminant analysis
lda.fit2 <- lda(km2$cluster ~., data = boston_scaled3)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(km2$cluster)

# plot the lda results
plot(lda.fit2, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit2, myscale = 4)

Variables age, black, and tax seem to be the most influential linear separators for the clusters.

Super-Bonus

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
lda.fit3 <- lda(crime ~., data = train)
model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit3$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit3$scaling
matrix_product <- as.data.frame(matrix_product)

classes <- as.numeric(train$crime)
train2 <- dplyr::select(train, -crime)
set.seed(123)
km3 <- kmeans(train2, centers = 4)
clusters <- as.numeric(km3$cluster)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=classes)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=clusters)

The cluster that is on the left-hand side seems quite similar in both plots and is quite well defined. In the second plot the clusters are more defined, in the first plot the rest of the clusters mix more with each other.


Dimensionality reduction techniques

date()
## [1] "Thu Dec 01 16:44:52 2022"

Short description of dataset

human <- read.table("./data/human.txt", header = TRUE)
dim(human)
## [1] 155   8
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

The data originates from the United Nations Development Programme
Original data from: http://hdr.undp.org/en/content/human-development-index-hdi

Meta file: https://hdr.undp.org/data-center/human-development-index#/indicies/HDI
Technical notes: https://hdr.undp.org/system/files/documents//technical-notes-calculating-human-development-indices.pdf

Dataset contains 155 observations of 8 variables.

Variable names:
“GNI” = Gross National Income per capita
“Life.Exp” = Life expectancy at birth
“Edu.Exp” = Expected years of schooling
“Mat.Mor” = Maternal mortality ratio
“Ado.Birth” = Adolescent birth rate
“Parli.F” = Percetange of female representatives in parliament
“Edu2.FM” = Proportion of females with at least secondary education / Proportion of males with at least secondary education
“Labo.FM” = Proportion of females in the labour force / Proportion of males in the labour force

Rownames are the countries.

Graphical overview of the data and summaries of the variables

library(GGally)
library(corrplot)
# check summary of data
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
# draw pairs plot of the variables
ggpairs(human)

# calculate and visualize the correlation matrix
cor(human) %>% corrplot(., method="circle")

Median life expectancy at birth is 74 years (49-84 years). Median expected years of schooling is 13.5 years (5-20 years). Median gross national income per capita (GNI) is 12 040 (581-123 124). Median maternal mortality ratio is 49 (1-1 100). Median adolescent birth rate is 47 (1-205). Median percentage of female representatives in parliament is 1% (0-58%).

Expected years of schooling correlates positively with life expectancy, GNI, and percentage of female representatives in parliament. It correlates negatively with maternal mortality ratio and adolescent birth rate. Life expectancy also correlates positively with GNI, and negatively with maternal mortality ratio and adolescent birth rate. Maternal mortality ratio correlates negatively with GNI and adolescent birth rate. These findings make a lot of sense.

Principal component analysis (PCA)

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Principal component analysis (PCA) with standardized variables

# standardize the variables
human_std <- scale(human)

# perform principal component analysis (with the SVD method)
pca_human_std <- prcomp(human_std)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human_std, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"))
text(-8, 9, "developed countries", col="blue")
text(10, 3, "developing countries", col="blue")
text(0, -9, "Arab countries", col="blue")

# create and print out a summary of pca_human_std
s <- summary(pca_human_std)

# rounded percentanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)

# print out the percentages of variance
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3

The results are different, because in the non-standardized data the variables have different scales. Therefore the variances in variables in the non-standardized data are higher. That is also why the first PCA biplot is dominated by GNI (widest scale).

According to the biplot of the standardized dataset, African countries (so-called developing countries) seem to have higher adolescent birt rate and maternal mortality ratio. Many European and other western countries, as well as Australia and Singapore, have higher life expectancy and expected years of schooling, as well as higher proportion of women in parliament. These countries can also be descriped as welfare states or developed countries. In the Arab countries, the proportion of women in parliament is low, as well as proportion of women working compared to men.

The first two principal components of the standardized dataset explain 53.6% and 16.2% of variability, respectively. PC1 is positively correlated with Mat.Mor and Ado.Birth, and negatively correlated with Edu.Exp, Life.Exp, GNI, and Edu2.FM. PC2 is positively correlated with Parli.F and Labo.FM.

# Work with the exercise in this chunk, step-by-step. Fix the R code!
# pca_human, dplyr are available

# create and print out a summary of pca_human
s <- summary(pca_human_std)
s
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
# rounded percentanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)

# print out the percentages of variance
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human_std, cex = c(0.6, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

Multiple Correspondence Analysis (MCA) on the tea data

library(dplyr)
library(tidyr)
library(ggplot2)
library(FactoMineR)
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

# Check the structure and dimensions of data
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
# Browse data
View(tea)

# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "friends", "age_Q", "tearoom", "tea.time")

# select the 'keep_columns' to create a new dataset
new_tea <- select(tea, all_of(keep_columns))

# Check the summaries and structure of the data
str(new_tea)
## 'data.frame':    300 obs. of  8 variables:
##  $ Tea     : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How     : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how     : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar   : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ age_Q   : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ tea.time: Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
summary(new_tea)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                                                                     
##         friends      age_Q           tearoom            tea.time  
##  friends    :196   +60  :38   Not.tearoom:242   Not.tea time:131  
##  Not.friends:104   15-24:92   tearoom    : 58   tea time    :169  
##                    25-34:69                                       
##                    35-44:40                                       
##                    45-59:61
# visualize the dataset
pivot_longer(new_tea, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free")+geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

# multiple correspondence analysis
mca <- MCA(new_tea, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = new_tea, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.231   0.204   0.162   0.155   0.136   0.131   0.129
## % of var.             12.319  10.893   8.653   8.279   7.244   6.976   6.906
## Cumulative % of var.  12.319  23.212  31.865  40.144  47.389  54.364  61.270
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.117   0.113   0.101   0.094   0.093   0.076   0.068
## % of var.              6.246   6.005   5.403   5.032   4.973   4.077   3.651
## Cumulative % of var.  67.516  73.521  78.925  83.957  88.930  93.006  96.657
##                       Dim.15
## Variance               0.063
## % of var.              3.343
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.143  0.029  0.011 |  0.598  0.583  0.186 | -0.449
## 2                  |  0.131  0.025  0.009 |  0.962  1.509  0.467 | -0.438
## 3                  |  0.123  0.022  0.015 |  0.056  0.005  0.003 | -0.067
## 4                  | -0.766  0.847  0.546 |  0.085  0.012  0.007 | -0.353
## 5                  | -0.251  0.091  0.050 |  0.634  0.656  0.318 | -0.083
## 6                  | -0.519  0.389  0.255 |  0.248  0.101  0.058 | -0.266
## 7                  |  0.153  0.034  0.017 | -0.097  0.015  0.007 | -0.279
## 8                  |  0.440  0.279  0.086 |  0.611  0.609  0.167 | -0.736
## 9                  |  0.280  0.113  0.037 |  0.108  0.019  0.006 | -0.334
## 10                 |  0.661  0.630  0.170 |  0.288  0.135  0.032 |  0.087
##                       ctr   cos2  
## 1                   0.414  0.105 |
## 2                   0.394  0.097 |
## 3                   0.009  0.004 |
## 4                   0.256  0.116 |
## 5                   0.014  0.005 |
## 6                   0.146  0.067 |
## 7                   0.160  0.058 |
## 8                   1.112  0.242 |
## 9                   0.229  0.053 |
## 10                  0.016  0.003 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.921  11.324   0.278   9.113 |   0.596   5.368   0.116
## Earl Grey          |  -0.324   3.665   0.190  -7.535 |  -0.412   6.698   0.307
## green              |  -0.168   0.167   0.003  -1.019 |   1.075   7.782   0.143
## alone              |  -0.141   0.695   0.037  -3.313 |   0.060   0.141   0.007
## lemon              |   0.038   0.008   0.000   0.229 |  -0.700   3.295   0.060
## milk               |   0.082   0.077   0.002   0.735 |   0.234   0.706   0.015
## other              |   2.331   8.820   0.168   7.088 |  -0.368   0.248   0.004
## tea bag            |  -0.308   2.910   0.124  -6.091 |   0.201   1.396   0.053
## tea bag+unpackaged |   0.322   1.758   0.047   3.761 |  -0.609   7.117   0.169
## unpackaged         |   0.614   2.446   0.051   3.919 |   0.643   3.038   0.056
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                5.900 |  -0.287   1.567   0.027  -2.841 |
## Earl Grey           -9.579 |  -0.070   0.242   0.009  -1.621 |
## green                6.536 |   1.052   9.383   0.137   6.397 |
## alone                1.405 |   0.212   2.260   0.084   5.006 |
## lemon               -4.253 |   0.514   2.243   0.033   3.128 |
## milk                 2.090 |  -0.715   8.273   0.136  -6.375 |
## other               -1.119 |  -1.484   5.087   0.068  -4.511 |
## tea bag              3.968 |  -0.552  13.324   0.399 -10.924 |
## tea bag+unpackaged  -7.116 |   0.526   6.690   0.126   6.149 |
## unpackaged           4.107 |   1.234  14.082   0.208   7.881 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.280 0.324 0.145 |
## How                | 0.177 0.072 0.232 |
## how                | 0.131 0.189 0.443 |
## sugar              | 0.225 0.087 0.019 |
## friends            | 0.030 0.428 0.024 |
## age_Q              | 0.366 0.281 0.395 |
## tearoom            | 0.356 0.126 0.021 |
## tea.time           | 0.281 0.127 0.019 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")

library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_mca_var(mca, choice = "mca.cor", 
            repel = TRUE, 
            ggtheme = theme_minimal())

For this exercise, tea data from FactoMineR package is used. 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).

There are 300 observations of 36 variables, of which we select 8 variables.

Agegroup +60 cluster with drinking black tea. Drinking green tea clusters with drinking tea without friends. Teabag tea is drank not in a tearoom and without additives. Earl Grey is drank with sugar.

Variables agegroup and tearoom are most correlated with dimension 1. Friends is most correlates with dimension 2.